\n", " | ingredient | \n", "description | \n", "
---|---|---|
0 | \n", "gin | \n", "Gin is a distilled alcoholic drink that derive... | \n", "
1 | \n", "vodka | \n", "Vodka is a distilled beverage composed primari... | \n", "
2 | \n", "rum | \n", "Rum is a distilled alcoholic beverage made fro... | \n", "
Machine learning is a set of methods that computers use to make and improve predictions or behaviors based on data.
\n", "For example, to predict the value of a house, the computer would learn patterns from past house sales.\n", "The book focuses on supervised machine learning, which covers all prediction problems where we have a dataset for which we already know the outcome of interest (e.g. past house prices) and want to learn to predict the outcome for new data.\n", "Excluded from supervised learning are for example clustering tasks (= unsupervised learning) where we do not have a specific outcome of interest, but want to find clusters of data points.\n", "Also excluded are things like reinforcement learning, where an agent learns to optimize a certain reward by acting in an environment (e.g. a computer playing Tetris).\n", "The goal of supervised learning is to learn a predictive model that maps features of the data (e.g. house size, location, floor type, …) to an output (e.g. house price).\n", "If the output is categorical, the task is called classification, and if it is numerical, it is called regression.\n", "The machine learning algorithm learns a model by estimating parameters (like weights) or learning structures (like trees).\n", "The algorithm is guided by a score or loss function that is minimized.\n", "In the house value example, the machine minimizes the difference between the estimated house price and the predicted price.\n", "A fully trained machine learning model can then be used to make predictions for new instances.
\n", "Estimation of house prices, product recommendations, street sign detection, credit default prediction and fraud detection:\n",
"All these examples have in common that they can be solved by machine learning.\n",
"The tasks are different, but the approach is the same:
\n",
"Step 1: Data collection.\n",
"The more, the better.\n",
"The data must contain the outcome you want to predict and additional information from which to make the prediction.\n",
"For a street sign detector (“Is there a street sign in the image?”), you would collect street images and label whether a street sign is visible or not.\n",
"For a credit default predictor, you need past data on actual loans, information on whether the customers were in default with their loans, and data that will help you make predictions, such as income, past credit defaults, and so on.\n",
"For an automatic house value estimator program, you could collect data from past house sales and information about the real estate such as size, location, and so on.
\n",
"Step 2: Enter this information into a machine learning algorithm that generates a sign detector model, a credit rating model or a house value estimator.
\n",
"Step 3: Use model with new data.\n",
"Integrate the model into a product or process, such as a self-driving car, a credit application process or a real estate marketplace website.
Machines surpass humans in many tasks, such as playing chess (or more recently Go) or predicting the weather.\n", "Even if the machine is as good as a human or a bit worse at a task, there remain great advantages in terms of speed, reproducibility and scaling.\n", "A once implemented machine learning model can complete a task much faster than humans, reliably delivers consistent results and can be copied infinitely.\n", "Replicating a machine learning model on another machine is fast and cheap.\n", "The training of a human for a task can take decades (especially when they are young) and is very costly.\n", "A major disadvantage of using machine learning is that insights about the data and the task the machine solves is hidden in increasingly complex models.\n", "You need millions of numbers to describe a deep neural network, and there is no way to understand the model in its entirety.\n", "Other models, such as the random forest, consist of hundreds of decision trees that “vote” for predictions.\n", "To understand how the decision was made, you would have to look into the votes and structures of each of the hundreds of trees.\n", "That just does not work no matter how clever you are or how good your working memory is.\n", "The best performing models are often blends of several models (also called ensembles) that cannot be interpreted, even if each single model could be interpreted.\n", "If you focus only on performance, you will automatically get more and more opaque models.\n", "\n", "The winning models on machine learning competitions are often ensembles of models or very complex models such as boosted trees or deep neural networks.
\n", "\n", "Machine learning is a set of methods that computers use to make and improve predictions or behaviors based on data.
" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find('p')" ] }, { "cell_type": "markdown", "metadata": { "id": "JSXst7ecfdmr" }, "source": [ "Furthermore, we can use `find_all` to return all elements of a given type in the page as an array and iterate over it, and pull out only the text of each paragraph using the `.text` attribute. Let's look at the first 3 paragraphs:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 33, "status": "ok", "timestamp": 1687981821254, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "wt647SWLfdms", "outputId": "a820e392-80e1-4179-f5a4-ad89e7854539" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Machine learning is a set of methods that computers use to make and improve predictions or behaviors based on data.\n", "For example, to predict the value of a house, the computer would learn patterns from past house sales.\n", "The book focuses on supervised machine learning, which covers all prediction problems where we have a dataset for which we already know the outcome of interest (e.g. past house prices) and want to learn to predict the outcome for new data.\n", "Excluded from supervised learning are for example clustering tasks (= unsupervised learning) where we do not have a specific outcome of interest, but want to find clusters of data points.\n", "Also excluded are things like reinforcement learning, where an agent learns to optimize a certain reward by acting in an environment (e.g. a computer playing Tetris).\n", "The goal of supervised learning is to learn a predictive model that maps features of the data (e.g. house size, location, floor type, …) to an output (e.g. house price).\n", "If the output is categorical, the task is called classification, and if it is numerical, it is called regression.\n", "The machine learning algorithm learns a model by estimating parameters (like weights) or learning structures (like trees).\n", "The algorithm is guided by a score or loss function that is minimized.\n", "In the house value example, the machine minimizes the difference between the estimated house price and the predicted price.\n", "A fully trained machine learning model can then be used to make predictions for new instances.\n" ] } ], "source": [ "# Check first 3 elements\n", "for elem in soup.find_all(\"p\")[0:2]:\n", " print(elem.text)" ] }, { "cell_type": "markdown", "metadata": { "id": "A0dXcuNRfdms" }, "source": [ "Great, now let's join it all together, and replace the newline characters with spaces, to create one giant string of text:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 29, "status": "ok", "timestamp": 1687981821254, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "8rlbYAkvfdmt", "outputId": "d5c50c6d-354b-4846-e2f9-9a86cec0db74", "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Machine learning is a set of methods that computers use to make and improve predictions or behaviors based on data. For example, to predict the value of a house, the computer would learn patterns from past house sales. The book focuses on supervised machine learning, which covers all prediction problems where we have a dataset for which we already know the outcome of interest (e.g. past house prices) and want to learn to predict the outcome for new data. Excluded from supervised learning are for example clustering tasks (= unsupervised learning) where we do not have a specific outcome of interest, but want to find clusters of data points. Also excluded are things like reinforcement learning, where an agent learns to optimize a certain reward by acting in an environment (e.g. a computer playing Tetris). The goal of supervised learning is to learn a predictive model that maps features of the data (e.g. house size, location, floor type, …) to an output (e.g. house price). If the output is categorical, the task is called classification, and if it is numerical, it is called regression. The machine learning algorithm learns a model by estimating parameters (like weights) or learning structures (like trees). The algorithm is guided by a score or loss function that is minimized. In the house value example, the machine minimizes the difference between the estimated house price and the predicted price. A fully trained machine learning model can then be used to make predictions for new instances. Estimation of house prices, product recommendations, street sign detection, credit default prediction and fraud detection: All these examples have in common that they can be solved by machine learning. The tasks are different, but the approach is the same: Step 1: Data collection. The more, the better. The data must contain the outcome you want to predict and additional information from which to make the prediction. For a street sign detector (“Is there a street sign in the image?”), you would collect street images and label whether a street sign is visible or not. For a credit default predictor, you need past data on actual loans, information on whether the customers were in default with their loans, and data that will help you make predictions, such as income, past credit defaults, and so on. For an automatic house value estimator program, you could collect data from past house sales and information about the real estate such as size, location, and so on. Step 2: Enter this information into a machine learning algorithm that generates a sign detector model, a credit rating model or a house value estimator. Step 3: Use model with new data. Integrate the model into a product or process, such as a self-driving car, a credit application process or a real estate marketplace website. Machines surpass humans in many tasks, such as playing chess (or more recently Go) or predicting the weather. Even if the machine is as good as a human or a bit worse at a task, there remain great advantages in terms of speed, reproducibility and scaling. A once implemented machine learning model can complete a task much faster than humans, reliably delivers consistent results and can be copied infinitely. Replicating a machine learning model on another machine is fast and cheap. The training of a human for a task can take decades (especially when they are young) and is very costly. A major disadvantage of using machine learning is that insights about the data and the task the machine solves is hidden in increasingly complex models. You need millions of numbers to describe a deep neural network, and there is no way to understand the model in its entirety. Other models, such as the random forest, consist of hundreds of decision trees that “vote” for predictions. To understand how the decision was made, you would have to look into the votes and structures of each of the hundreds of trees. That just does not work no matter how clever you are or how good your working memory is. The best performing models are often blends of several models (also called ensembles) that cannot be interpreted, even if each single model could be interpreted. If you focus only on performance, you will automatically get more and more opaque models. The winning models on machine learning competitions are often ensembles of models or very complex models such as boosted trees or deep neural networks. \n" ] } ], "source": [ "text = \"\"\n", "\n", "for paragraph in soup.find_all(\"p\"):\n", " text += paragraph.text.replace('\\n', ' ') + ' '\n", "\n", "print(text)" ] }, { "cell_type": "markdown", "metadata": { "id": "SndgFOaufdmt" }, "source": [ "We now have a scraped text data from a website! Doing so for more complicated pages or programatically over many pages can be accomplished with more code and inspecting the different pages' structure. The difficulty or ease of doing so will depend upon how the page is site is structured and page code." ] }, { "cell_type": "markdown", "metadata": { "id": "Gz9_m9W5fdmt" }, "source": [ "## Data Preprocessing\n", "\n", "We have now acquired some text. Before using this text in an ML application of NLP, we first need to preprocess the data.\n", "\n", "As outlined in the slides, major steps in preprocessing text are:\n", "- Normalization (addressing case, removing punctuation and stop words, stemming or lemmatization)\n", "- Tokenization (breaking up into individual units of language, usually words)\n", "- Vectorization (converting tokens to structured numeric data)" ] }, { "cell_type": "markdown", "metadata": { "id": "UM5xq1twfdmu" }, "source": [ "### Normalization\n", "\n", "There are a few things we need to do here: *addressing case, removing punctuation, and stemming or lemmatization*. For simplicity's sake, we will not expand contractions (don't, won't, can't, etc.) though this would be another normalization step. We will also only try the simpler technique of stemming, though there are lemmatizers built in to packages such as [nltk](https://www.nltk.org/_modules/nltk/stem/wordnet.html) and [TextBlob](https://textblob.readthedocs.io/en/dev/quickstart.html#words-inflection-and-lemmatization).\n", "\n", "To standardize the case, we simply convert everything to lowercase:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 70 }, "executionInfo": { "elapsed": 26, "status": "ok", "timestamp": 1687981821255, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "OQpIAHZifdmu", "outputId": "b8dbccea-a218-4f13-c470-5aafd9bd2e2b" }, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'machine learning is a set of methods that computers use to make and improve predictions or behaviors based on data. for example, to predict the value of a house, the computer would learn patterns from past house sales. the book focuses on supervised machine learning, which covers all prediction problems where we have a dataset for which we already know the outcome of interest (e.g.\\xa0past house prices) and want to learn to predict the outcome for new data. excluded from supervised learning are for'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Convert to lower case\n", "text = text.lower()\n", "text[0:500]" ] }, { "cell_type": "markdown", "metadata": { "id": "i-5ifOJkfdmv" }, "source": [ "It appears there are also some unicode characters mixed in there, which is never good. Dealing with special characters and different text encodings can be one of the challenge parts of doing NLP. We will change the encoding to ASCII to address these:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 70 }, "executionInfo": { "elapsed": 25, "status": "ok", "timestamp": 1687981821256, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "lpDjQLslfdmv", "outputId": "3fc86e99-db04-4b26-ea58-8ef5949912bb" }, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'machine learning is a set of methods that computers use to make and improve predictions or behaviors based on data. for example, to predict the value of a house, the computer would learn patterns from past house sales. the book focuses on supervised machine learning, which covers all prediction problems where we have a dataset for which we already know the outcome of interest (e.g.past house prices) and want to learn to predict the outcome for new data. excluded from supervised learning are for '" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = text.encode('ASCII', errors='ignore').decode()\n", "text[0:500]" ] }, { "cell_type": "markdown", "metadata": { "id": "Vu-77Ypsfdmw" }, "source": [ "That's better, we can see the special characters like `\\xa0` that were present before are now gone. Next we will remove all punctuation. Fortunately, all the punctuation characters are contained in a string stored in the python `string` base module:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 23, "status": "ok", "timestamp": 1687981821257, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "VYJRB8qmfdm3", "outputId": "393bd80c-db69-40b3-d758-9984125e24fc" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~\n" ] } ], "source": [ "from string import punctuation\n", "\n", "print(punctuation)" ] }, { "cell_type": "markdown", "metadata": { "id": "VaTLP6n1fdm3" }, "source": [ "We can now iterate over each punctuation character, and update the text, replacing it with the empty string `''`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ppvgUAKMfdm4" }, "outputs": [], "source": [ "for punctuation_mark in punctuation:\n", " text = text.replace(punctuation_mark, '')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 122 }, "executionInfo": { "elapsed": 18, "status": "ok", "timestamp": 1687981821258, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "xRzf8fczfdm4", "outputId": "7d18514e-75bd-4923-faf7-ff229a544e55" }, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'machine learning is a set of methods that computers use to make and improve predictions or behaviors based on data for example to predict the value of a house the computer would learn patterns from past house sales the book focuses on supervised machine learning which covers all prediction problems where we have a dataset for which we already know the outcome of interest egpast house prices and want to learn to predict the outcome for new data excluded from supervised learning are for example clustering tasks unsupervised learning where we do not have a specific outcome of interest but want to find clusters of data points also excluded are things like reinforcement learning where an agent learns to optimize a certain reward by acting in an environment ega computer playing tetris the goal of supervised learning is to learn a predictive model that maps features of the data eghouse size location floor type to an output eghouse price if the output is categorical the task is called classi'" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check the result\n", "text[0:1000]" ] }, { "cell_type": "markdown", "metadata": { "id": "-f_rCGu2fdm5" }, "source": [ "We can already see some strange things are happening, such as the *e.g.* getting folded into the words \"past\" and \"house\" to create the tokens \"egpast\" and \"eghouse\". Preprocessing text is not an exact science... we will proceed as is for now, though there perhaps could have been better ways to tokenize or deal with problematic portions of this text such as abbreviations." ] }, { "cell_type": "markdown", "metadata": { "id": "CcBunenJfdm5" }, "source": [ "### Stemming\n", "Stemming and lemmatization are built into the very powerful [nltk toolkit](https://www.nltk.org/). Here we choose to do simple stemming using the `SnowBallStemmer` (a particular type of stemming algorithm):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 36 }, "executionInfo": { "elapsed": 1878, "status": "ok", "timestamp": 1687981823119, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "Jco34Xp-fdm5", "outputId": "4ef2482d-d309-499e-e025-ea43cc0bd6bf" }, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'run'" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from nltk.stem import SnowballStemmer\n", "\n", "# Instantiate stemmer for english\n", "sbstem = SnowballStemmer('english')\n", "\n", "# Check\n", "sbstem.stem('running')" ] }, { "cell_type": "markdown", "metadata": { "id": "lC5socchfdm6" }, "source": [ "Stemmers in `nltk` operation on individual tokens, so we must iterate over the freeform text, then join everything back together again:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bdgtcAYLfdm6" }, "outputs": [], "source": [ "# Create an empty list for the stemmed words\n", "stemmed_words = list()\n", "\n", "# Iterate over each word and stem and add to new string\n", "for word in text.split(' '):\n", " stemmed_words.append(sbstem.stem(word))\n", "\n", "# Join it all back together and remove any repeated or extraneous spaces\n", "text = ' '.join(stemmed_words).strip()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 70 }, "executionInfo": { "elapsed": 255, "status": "ok", "timestamp": 1687981823368, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "XspuoPN1fdm7", "outputId": "e8f59f05-cbea-40f1-ef11-0f9bd23ab11e" }, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'machin learn is a set of method that comput use to make and improv predict or behavior base on data for exampl to predict the valu of a hous the comput would learn pattern from past hous sale the book focus on supervis machin learn which cover all predict problem where we have a dataset for which we alreadi know the outcom of interest egpast hous price and want to learn to predict the outcom for new data exclud from supervis learn are for exampl cluster task unsupervis learn where we do not hav'" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check\n", "text[0:500]" ] }, { "cell_type": "markdown", "metadata": { "id": "N2RUv2R1fdm7" }, "source": [ "We can see that words like *machine* have been stemmed to *machin*, and *improve* to *improv*, and so on, so we appear to have applied stemming correctly. This is enough normalization for now, and we can move on to tokenizing and vectorizing our text." ] }, { "cell_type": "markdown", "metadata": { "id": "PUmbRzM8fdm7" }, "source": [ "### Tokenization" ] }, { "cell_type": "markdown", "metadata": { "id": "gYEpKgSFfdm8" }, "source": [ "Our approach for tokenization could be as simple as splitting on whitespace. As we saw above, we actually did this step and then undid it, as the `nltk` stemmer works on individual words. Alternatively, we could have applied stemming *after* or as part of tokenization, as the different steps in text preprocessing are not necessarily always in a particular order depending upon implementation).\n", "\n", "Splitting on whitespace is as simple as using the `.split()` method in base python built in to any string variable:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 21, "status": "ok", "timestamp": 1687981823370, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "SxB3oiIzfdm8", "outputId": "744f5407-f1b3-444e-f9ef-2557d4cc474d" }, "outputs": [ { "data": { "text/plain": [ "['machin', 'learn', 'is', 'a', 'set', 'of', 'method', 'that', 'comput', 'use']" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Tokenize and show first 10 tokens\n", "text.split(' ')[0:10]" ] }, { "cell_type": "markdown", "metadata": { "id": "j-9ONzIzfdm8" }, "source": [ "More sophisticated approaches for tokenization exist. We actually do not need do this step manually, as it is included in the vectorization step in code as part of `scikit-learn` as we will see below." ] }, { "cell_type": "markdown", "metadata": { "id": "NXJ6Hy-gfdm9" }, "source": [ "### Vectorization\n", "\n", "There are two standard types of vectorization used in traditional NLP: *count vectorization* and *term frequency - inverse document frequency (tf-idf)* vectorization. Binary (\"One-hot\") encoding with a boolean (0/1) flag for word occurrence in each document can also be done, though this is less common.\n", "\n", "The two former vectorization methods are implemented in `scikit-learn` in the `feature_extraction.text` submodule and we can apply as below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "RYndP7OUfdm9" }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n", "\n", "# Instatiate, fit and transform - Count vectorization\n", "cv = CountVectorizer()\n", "count_vectorized = cv.fit_transform([text])\n", "\n", "# Instatiate, fit and transform - TF-IDF vectorization\n", "tfidf = TfidfVectorizer()\n", "tfidf_vectorized = tfidf.fit_transform([text])" ] }, { "cell_type": "markdown", "metadata": { "id": "31Gqguvzfdm-" }, "source": [ "Let's take a look at the outputs:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 18, "status": "ok", "timestamp": 1687981823372, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "YZ7CZEbHfdm_", "outputId": "7a788dff-3eb2-462c-db07-b0044aed8110" }, "outputs": [ { "data": { "text/plain": [ "<1x263 sparse matrix of type '\n", " | about | \n", "act | \n", "actual | \n", "addit | \n", "advantag | \n", "agent | \n", "algorithm | \n", "all | \n", "alreadi | \n", "also | \n", "... | \n", "which | \n", "will | \n", "win | \n", "with | \n", "work | \n", "wors | \n", "would | \n", "you | \n", "young | \n", "your | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "2 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "3 | \n", "2 | \n", "1 | \n", "2 | \n", "... | \n", "3 | \n", "2 | \n", "1 | \n", "2 | \n", "2 | \n", "1 | \n", "3 | \n", "10 | \n", "1 | \n", "1 | \n", "
1 rows × 263 columns
\n", "\n", " | about | \n", "act | \n", "actual | \n", "addit | \n", "advantag | \n", "agent | \n", "algorithm | \n", "all | \n", "alreadi | \n", "also | \n", "... | \n", "which | \n", "will | \n", "win | \n", "with | \n", "work | \n", "wors | \n", "would | \n", "you | \n", "young | \n", "your | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0.02557 | \n", "0.012785 | \n", "0.012785 | \n", "0.012785 | \n", "0.012785 | \n", "0.012785 | \n", "0.038355 | \n", "0.02557 | \n", "0.012785 | \n", "0.02557 | \n", "... | \n", "0.038355 | \n", "0.02557 | \n", "0.012785 | \n", "0.02557 | \n", "0.02557 | \n", "0.012785 | \n", "0.038355 | \n", "0.127848 | \n", "0.012785 | \n", "0.012785 | \n", "
1 rows × 263 columns
\n", "Copyright NLP from scratch, 2024. | \n",
" ![]() | \n",
"
---|